CD-ROMs containing patent information and technical information are nowadays available in every country in each language. For example, in 1993 the Japanese Patent Office begins publication of patent application bulletin in the form of CD-ROM. CD-ROM laid open gazette search system has spread rapidly with publication of these CD-ROMs. Such a kind of search system retrieves a desirable gazette from the patent application bulletin recorded on the CD-ROM. The retrieved gazette can be displayed, printed out, and output as a text file. The text file can be used, for example, as original text on which translation using an automatic translation equipment is performed.
Incidentally, such patent application bulletin recorded on the CD-ROM published by the Japanese Patent Office contains a variety of control information in addition to text information and image information. Among such control information are various codes which cannot be converted into text information and are automatically converted into "=". Tags indicative of the location to insert image data or user-registered characters, tags indicative of a layout and other tags indicative of the logic structure are unnecessary garbage data when only text information is needed.
For example, when a front page file of the gazette recorded on the CD-ROM is output, image data on the file becomes, when deleted, a blank page with irrelevant marks such as an indication mark [frame 1] or continuance mark "===". Moreover, a blank page on the original file may contain a plurality of codes indicative of a new line, and the end of a sentence in each text may have a plurality of spaces. It is thus difficult to output the original file in a well-refined style.
Because of the irregularity of the style, the search system requires revision of the output text files one by one, thereby increasing labor cost and inconveniences. Moreover, in the above search system one file is output for each single gazette. A number of text files are output for a number of front pages. The search system needs repeated manual labor to link these text files, accordingly.
In addition, patent specifications have fluctuation of expression, alienation from correct grammar and long sentences too often to be correctly translated into other language. These are serious problems for machine translation. Therefore, the pre-editing of machine translation requires labor to convert the words and sentences into ones which conforms to the machine translation process. Alternatively, parenthesis "( )" or other marks would be inserted into the original text or another procedure would be taken to point out problematic places in the document. Regardless of which method is chosen, however, the patent specifications have required tremendous labor cost and time before translation. A solution for performing a speedy and facilitated translation on a large quantity of patent gazettes has been long awaited.